Netid: nc230
Note: this assignment falls under collaboration Mode 2: Individual Assignment – Collaboration Permitted. Please refer to the syllabus for additional information.
Instructions for all assignments can be found here, and is also linked to from the course syllabus.
Total points in the assignment add up to 90; an additional 10 points are allocated to presentation quality.
Through completing this assignment you will be able to...
You've learned a great deal about supervised learning and now it's time to bring together all that you've learned. You will be competing in a Kaggle Competition along with the rest of the class! Your goal is to predict hotel reservation cancellations based on a number of potentially related factors such as lead time on the booking, time of year, type of room, special requests made, number of children, etc. While you will be asked to take certain steps along the way to your submission, you're encouraged to try creative solutions to this problem and your choices are wide open for you to make your decisions on how to best make the predictions.
You can view the public leaderboard anytime here
The Data. The dataset is provided as a5_q1.pkl which is a pickle file format, which allows you to load the data directly using the code below; the data can be downloaded from the Kaggle competition website. A data dictionary for the project can be found here and the original paper that describes the dataset can be found here. When you load the data, 5 matrices are provided X_train_original, y_train, and X_test_original, which are the original, unprocessed features and labels for the training set and the test features (the test labels are not provided - that's what you're predicting). Additionally, X_train_ohe and X_test_ohe are provided which are one-hot-encoded (OHE) versions of the data. The OHE versions OHE processed every categorical variable. This is provided for convenience if you find it helpful, but you're welcome to reprocess the original data other ways if your prefer.
Scoring. You will need to achieve a minimum acceptable level of performance to demonstrate proficiency with using these supervised learning techniques. Beyond that, it's an open competition and scoring in the top three places of the private leaderboard will result in 5 bonus points in this assignment (and the pride of the class!). Note: the Kaggle leaderboard has a public and private component. The public component is viewable throughout the competition, but the private leaderboard is revealed at the end. When you make a submission, you immediately see your submission on the public leaderboard, but that only represents scoring on a fraction of the total collection of test data, the rest remains hidden until the end of the competition to prevent overfitting to the test data through repeated submissions. You will be be allowed to hand-select two eligible submissions for private score, or by default your best two public scoring submissions will be selected for private scoring.
(a) Explore your data. Review and understand your data. Look at it; read up on what the features represent; think through the application domain; visualize statistics from the paper data to understand any key relationships. There is no output required for this question, but you are encouraged to explore the data personally before going further.
(b) Preprocess your data. Preprocess your data so it's ready for use for classification and describe what you did and why you did it. Preprocessing may include: normalizing data, handling missing or erroneous values, separating out a validation dataset, preparing categorical variables through one-hot-encoding, etc. To make one step in this process easier, you're provided with a one-hot-encoded version of the data already.
(c) Select, train, and compare models. Fit at least 5 models to the data. Some of these can be experiments with different hyperparameter-tuned versions of the same model, although all 5 should not be the same type of model. There are no constraints on the types of models, but you're encouraged to explore examples we've discussed in class including:
Scikit-learn offers a number of tools for assisting with this including those for bagging, boosting, and stacking. You're also welcome to explore options beyond the sklean universe; for example, some of you may have heard of XGBoost which is a very fast implementation of gradient boosted decision trees that also allows for parallelization.When selecting models, be aware that some models may take far longer than others to train. Monitor your output and plan your time accordingly.
Assess the classification performance AND computational efficiency of the models you selected:
(d) Apply your model "in practice". Make at least 5 submissions of different model results to the competition (more submissions are encouraged and you can submit up to 5 per day!). These do not need to be the same that you report on above, but you should select your most competitive models.
pipeline module which lets you group a series of processing steps together.Below is some code for (1) loading the data and (2) once you have predictions in the form of confidence scores for those classifiers, to produce submission files for Kaggle.
import pandas as pd
import numpy as np
import pickle
################################
# Load the data
################################
data = pickle.load( open( "./data/a5_q1.pkl", "rb" ) )
y_train = data['y_train']
X_train_original = data['X_train'] # Original dataset
X_train_ohe = data['X_train_ohe'] # One-hot-encoded dataset
X_test_original = data['X_test']
X_test_ohe = data['X_test_ohe']
################################
# Produce submission
################################
def create_submission(confidence_scores, save_path):
'''Creates an output file of submissions for Kaggle
Parameters
----------
confidence_scores : list or numpy array
Confidence scores (from predict_proba methods from classifiers) or
binary predictions (only recommended in cases when predict_proba is
not available)
save_path : string
File path for where to save the submission file.
Example:
create_submission(my_confidence_scores, './data/submission.csv')
'''
import pandas as pd
submission = pd.DataFrame({"score":confidence_scores})
submission.to_csv(save_path, index_label="id")
ANSWER
Preliminary Exploration of Data
import warnings
warnings.filterwarnings('ignore')
X_train_original.head()
| hotel | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | ... | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | ... | C | 3 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 |
| 2 | Resort Hotel | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | ... | C | 0 | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 |
| 3 | Resort Hotel | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | ... | A | 0 | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 |
| 4 | Resort Hotel | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | ... | A | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 |
| 5 | Resort Hotel | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | ... | A | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 |
5 rows × 29 columns
print(f"The shape of the original training dataset is {X_train_original.shape}, and the shape of the one-hot-encoded training dataset is {X_train_ohe.shape}.")
The shape of the original training dataset is (95512, 29), and the shape of the one-hot-encoded training dataset is (95512, 940).
pd.DataFrame({'Mean': X_train_original.mean(), 'Standard Deviation': X_train_original.std()})
| Mean | Standard Deviation | |
|---|---|---|
| lead_time | 103.849768 | 106.722804 |
| arrival_date_year | 2016.157205 | 0.707470 |
| arrival_date_week_number | 27.152902 | 13.611204 |
| arrival_date_day_of_month | 15.823038 | 8.786777 |
| stays_in_weekend_nights | 0.928491 | 0.999940 |
| stays_in_week_nights | 2.503288 | 1.918017 |
| adults | 1.855746 | 0.596925 |
| children | 0.103696 | 0.397763 |
| babies | 0.007748 | 0.093348 |
| is_repeated_guest | 0.031598 | 0.174929 |
| previous_cancellations | 0.087235 | 0.844491 |
| previous_bookings_not_canceled | 0.140035 | 1.532968 |
| booking_changes | 0.220621 | 0.653900 |
| agent | 86.893778 | 110.839209 |
| company | 188.237117 | 131.459182 |
| days_in_waiting_list | 2.316348 | 17.651287 |
| adr | 101.679313 | 50.906371 |
| required_car_parking_spaces | 0.062673 | 0.246274 |
| total_of_special_requests | 0.571310 | 0.792712 |
Plotting of Data
import matplotlib.pyplot as plt
y_train.hist()
plt.title("Distribution of the target variable")
plt.xlabel("Target variable")
plt.ylabel("Frequency")
Text(0, 0.5, 'Frequency')
fig, axs = plt.subplots((X_train_original.shape[1] + 1) // 2, 2, figsize = (15, 100))
concatData = pd.concat((X_train_original, y_train), axis = 1)
for i, eachColumn in enumerate(X_train_original.columns):
for is_canceled in range(2):
# axs[i // 2][i % 2].hist(concatData.query(f'is_canceled == {is_canceled}')[eachColumn], alpha = 0.5, label = f'is_canceled = {is_canceled}')
concatData.query(f'is_canceled == {is_canceled}')[eachColumn].hist(ax = axs[i // 2][i % 2], alpha = 0.5, label = f'is_canceled = {is_canceled}')
axs[i // 2][i % 2].set_title(f"Distribution of {eachColumn} for \n reservations that were canceled and not", fontsize = 10)
axs[i // 2][i % 2].set_xlabel(eachColumn, fontsize = 10)
axs[i // 2][i % 2].set_ylabel("Frequency", fontsize = 10)
Reviewing Missing Values
X_train_ohe.isna().sum()[X_train_ohe.isna().sum() != 0]
children 2 dtype: int64
Imputing missing children values with the mean
filledData = X_train_ohe.fillna(data['X_train'].loc[:, 'children'].mean())
Creating a Normalized Data Set
normalizedData = (filledData - filledData.mean()) / filledData.std().replace(0, 1)
assert normalizedData['lead_time'].equals((filledData['lead_time'] - filledData['lead_time'].mean()) / filledData['lead_time'].std())
assert normalizedData['customer_type_Transient'].equals((filledData['customer_type_Transient'] - filledData['customer_type_Transient'].mean()) / filledData['customer_type_Transient'].std())
assert normalizedData.isna().sum().sum() == 0
Visualizing Data along Principal Components
from sklearn.decomposition import PCA
import altair as alt
alt.data_transformers.disable_max_rows()
pca = PCA(n_components=2)
pca_X = pca.fit_transform(filledData)
alt.Chart(
pd.concat(
(
pd.DataFrame(
pca_X, columns=["pc_1", "pc_2"], index=filledData.index
),
y_train,
),
axis=1,
), title = "PCA of the one-hot-encoded dataset"
).mark_point().encode(x="pc_1", y="pc_2", color="is_canceled:N")
norm_pca = PCA(n_components=2)
norm_pca_X = norm_pca.fit_transform(normalizedData)
alt.Chart(
pd.concat(
(
pd.DataFrame(
norm_pca_X, columns=["pc_1", "pc_2"], index=normalizedData.index
),
y_train,
),
axis=1,
), title = "Principal Components Analysis on Normalized Data"
).mark_point().encode(x="pc_1", y="pc_2", color="is_canceled:N")
Visualizing Data with t-SNE
from sklearn.manifold import TSNE
charts = []
for my_perplexity in (5, 10, 20, 50, 100, 200, 500, 1000):
tsne = TSNE(n_components = 2, perplexity=my_perplexity)
tsne_X = tsne.fit_transform(filledData[:2000])
charts.append(alt.Chart(
pd.concat(
(
pd.DataFrame(
tsne_X, columns=["tsne_1", "tsne_2"]
),
y_train.reset_index(drop = True)[:2000],
),
axis=1,
), title = f"TSNE with perplexity = {my_perplexity}"
).mark_point().encode(x="tsne_1", y="tsne_2", color="is_canceled:N"))
for eachChart in charts:
display(eachChart)
norm_charts = []
for my_perplexity in (5, 10, 20, 50, 100, 200, 500, 1000):
norm_tsne = TSNE(n_components = 2, perplexity=my_perplexity)
norm_tsne_X = norm_tsne.fit_transform(normalizedData[:2000])
norm_charts.append(alt.Chart(
pd.concat(
(
pd.DataFrame(
norm_tsne_X, columns=["tsne_1", "tsne_2"]
),
y_train.reset_index(drop = True)[:2000],
),
axis=1,
), title = f"t-SNE on normalized data with Perplexity = {my_perplexity}"
).mark_point().encode(x="tsne_1", y="tsne_2", color="is_canceled:N"))
for normChart in norm_charts:
display(normChart)
Setting up Validation set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train_split, y_val = train_test_split(filledData, y_train, test_size = 0.2, random_state = 42)
X_train_norm, X_val_norm, y_train_norm, y_val_norm = train_test_split(normalizedData, y_train, test_size = 0.2, random_state = 42)
Modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
rf = RandomForestClassifier()
rf.fit(X_train, y_train_split)
rf_accuracy = rf.score(X_val, y_val)
rf_roc = roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1])
print(f"The validation accuracy of the random forest model is {rf_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {rf_roc * 100: .2f}%.")
The validation accuracy of the random forest model is 89.02%, and the ROC AUC score on the validation set is 95.61%.
norm_rf = RandomForestClassifier(n_estimators=200)
norm_rf.fit(X_train_norm, y_train_norm)
norm_rf_accuracy = norm_rf.score(X_val_norm, y_val_norm)
norm_rf_roc = roc_auc_score(y_val_norm, norm_rf.predict_proba(X_val_norm)[:, 1])
print(f"The validation accuracy of the random forest model on the normalized data is {norm_rf_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {norm_rf_roc * 100: .2f}%.")
The validation accuracy of the random forest model on the normalized data is 88.98%, and the ROC AUC score on the validation set is 95.71%.
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, PredefinedSplit
from scipy.stats import loguniform
import numpy as np
mlp = MLPClassifier(hidden_layer_sizes=(100, 100))
# mlp = MLPClassifier()
# # Specify model hyperparameter search space
# param_dist = {
# "learning_rate_init": loguniform(1e-4, 1e-2),
# "alpha": loguniform(1e-7, 10),
# "batch_size": [50, 100, 200],
# "hidden_layer_sizes": [(10, 10), (30, 30), (50, 50), (100,), (100, 100)],
# }
# # For RandomSeachCV, we will need to combine training and validation sets then
# # specify which portion is training and which is validation
# # Also, for the final performance evaluation, train on all of the training AND validation data
# X_train_plus_val = np.concatenate((X_train[:5000], X_val), axis=0)
# y_train_plus_val = np.concatenate((y_train[:5000], y_val), axis=0)
# # Create a predefined train/test split for RandomSearchCV (to be used later)
# validation_fold = np.concatenate((-1 * np.ones(len(y_train[:5000])), np.zeros(len(y_val))))
# train_val_split = PredefinedSplit(validation_fold)
# # run randomized search
# n_iter_search = 50
# random_search = RandomizedSearchCV(
# mlp,
# param_distributions=param_dist,
# n_iter=n_iter_search,
# n_jobs=-1,
# cv=train_val_split,
# )
# random_search.fit(X_train_plus_val, y_train_plus_val)
# mlp.set_params(**random_search.best_params_)
mlp.fit(X_train[:10_000], y_train_split[:10_000])
mlp_accuracy = mlp.score(X_val, y_val)
mlp_roc = roc_auc_score(y_val, mlp.predict_proba(X_val)[:, 1])
print(f"The validation accuracy of the MLP model is {mlp_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {mlp_roc * 100: .2f}%.")
The validation accuracy of the MLP model is 83.18%, and the ROC AUC score on the validation set is 91.08%.
norm_mlp = MLPClassifier(hidden_layer_sizes=(300, 300), max_iter=1000)
# norm_mlp = MLPClassifier()
# # Specify model hyperparameter search space
# param_dist = {
# "learning_rate_init": loguniform(1e-4, 1e-2),
# "alpha": loguniform(1e-7, 10),
# "batch_size": [50, 100, 200],
# "hidden_layer_sizes": [(10, 10), (30, 30), (50, 50), (100,), (100, 100)],
# }
# # For RandomSeachCV, we will need to combine training and validation sets then
# # specify which portion is training and which is validation
# # Also, for the final performance evaluation, train on all of the training AND validation data
# X_train_plus_val = np.concatenate((X_train_norm[:5000], X_val_norm), axis=0)
# y_train_plus_val = np.concatenate((y_train_norm[:5000], y_val_norm), axis=0)
# # Create a predefined train/test split for RandomSearchCV (to be used later)
# validation_fold = np.concatenate((-1 * np.ones(len(y_train_norm[:5000])), np.zeros(len(y_val))))
# train_val_split = PredefinedSplit(validation_fold)
# # run randomized search
# n_iter_search = 50
# random_search = RandomizedSearchCV(
# norm_mlp,
# param_distributions=param_dist,
# n_iter=n_iter_search,
# n_jobs=-1,
# cv=train_val_split,
# )
# random_search.fit(X_train_plus_val, y_train_plus_val)
# norm_mlp.set_params(**random_search.best_params_)
norm_mlp.fit(X_train_norm[:30_000], y_train_norm[:30_000])
norm_mlp_accuracy = norm_mlp.score(X_val_norm, y_val_norm)
norm_mlp_roc = roc_auc_score(y_val_norm, norm_mlp.predict_proba(X_val_norm)[:, 1])
print(f"The validation accuracy of the MLP model is {norm_mlp_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {norm_mlp_roc * 100: .2f}%.")
The validation accuracy of the MLP model is 85.56%, and the ROC AUC score on the validation set is 92.79%.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'saga', penalty = 'l1', C = 0.1, max_iter=200)
lr.fit(X_train_norm, y_train_norm)
lr_accuracy = lr.score(X_val_norm, y_val_norm)
lr_roc = roc_auc_score(y_val_norm, lr.predict_proba(X_val_norm)[:, 1])
print(f"The validation accuracy of the logistic regression model on the normalized data is {lr_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {lr_roc * 100: .2f}%.")
The validation accuracy of the logistic regression model on the normalized data is 83.54%, and the ROC AUC score on the validation set is 91.26%.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_norm, y_train_norm)
knn_accuracy = knn.score(X_val_norm, y_val_norm)
knn_roc = roc_auc_score(y_val_norm, knn.predict_proba(X_val_norm)[:, 1])
print(f"The validation accuracy of the KNN model on the normalized data is {knn_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {knn_roc * 100: .2f}%.")
The validation accuracy of the KNN model on the normalized data is 82.02%, and the ROC AUC score on the validation set is 87.89%.
filledTest = X_test_ohe.fillna(data['X_train'].loc[:, 'children'].mean())
normalizedTest = (filledTest - filledData.mean()) / filledData.std().replace(0, 1)
assert normalizedTest['lead_time'].equals((filledTest['lead_time'] - filledData['lead_time'].mean()) / filledData['lead_time'].std())
assert normalizedTest['customer_type_Transient'].equals((filledTest['customer_type_Transient'] - filledData['customer_type_Transient'].mean()) / filledData['customer_type_Transient'].std())
assert normalizedTest.isna().sum().sum() == 0
create_submission(norm_rf.predict_proba(normalizedTest)[:, 1], 'submission3.csv')
norm_pca_n = PCA(n_components=X_train_norm.shape[1])
norm_pca_X_n = norm_pca_n.fit_transform(X_train_norm)
norm_pca_val_n = norm_pca_n.transform(X_val_norm)
norm_pca_rf = RandomForestClassifier(n_estimators=200)
norm_pca_rf.fit(norm_pca_X_n, y_train_norm)
norm_pca_rf_accuracy = norm_pca_rf.score(norm_pca_val_n, y_val_norm)
norm_pca_rf_roc = roc_auc_score(y_val_norm, norm_pca_rf.predict_proba(norm_pca_val_n)[:, 1])
print(f"The validation accuracy of the random forest model on the normalized data is {norm_pca_rf_accuracy * 100: .2f}%, and the ROC AUC score on the validation set is {norm_pca_rf_roc * 100: .2f}%.")
The validation accuracy of the random forest model on the normalized data is 87.20%, and the ROC AUC score on the validation set is 94.35%.
Clustering can be used to reveal structure between samples of data and assign group membership to similar groups of samples. This exercise will provide you with experience applying clustering algorithms and comparing these techniques on various datasets to experience the pros and cons of these approaches when the structure of the data being clustered varies. For this exercise, we'll explore clustering in two dimensions to make the results more tangible, but in practice these approaches can be applied to any number of dimensions.
Note: For each set of plots across the five datasets, please create subplots within a single figure (for example, when applying DBSCAN - please show the clusters resulting from DBSCAN as a single figure with one subplot for each dataset). This will make comparison easier.
(a) Run K-means and choose the number of clusters. Five datasets are provided for you below and the code to load them below.
inertia_ attribute of a fit KMeans class instance.(b) Apply DBSCAN. Vary the eps and min_samples parameters to get as close as you can to having the same number of clusters as your choices with K-means. In this case, the black points are points that were not assigned to clusters.
(c) Apply Spectral Clustering. Select the same number of clusters as selected by k-means.
(d) Comment on the strengths and weaknesses of each approach. In particular, mention:
Note: For these clustering plots in this question, do NOT include legends indicating cluster assignment; instead, just make sure the cluster assignments are clear from the plot (e.g. different colors for each cluster)
Code is provided below for loading the datasets and for making plots with the clusters as distinct colors
################################
# Load the data
################################
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons
# Create / load the datasets:
n_samples = 1500
X0, _ = make_blobs(n_samples=n_samples, centers=2, n_features=2, random_state=0)
X1, _ = make_blobs(n_samples=n_samples, centers=5, n_features=2, random_state=0)
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state, cluster_std=1.3)
transformation = [[0.6, -0.6], [-0.2, 0.8]]
X2 = np.dot(X, transformation)
X3, _ = make_blobs(n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state)
X4, _ = make_moons(n_samples=n_samples, noise=.12)
X = [X0, X1, X2, X3, X4]
# The datasets are X[i], where i ranges from 0 to 4
################################
# Code to plot clusters
################################
def plot_cluster(ax, data, cluster_assignments):
'''Plot two-dimensional data clusters
Parameters
----------
ax : matplotlib axis
Axis to plot on
data : list or numpy array of size [N x 2]
Clustered data
cluster_assignments : list or numpy array [N]
Cluster assignments for each point in data
'''
clusters = np.unique(cluster_assignments)
n_clusters = len(clusters)
for ca in clusters:
kwargs = {}
if ca == -1:
# if samples are not assigned to a cluster (have a cluster assignment of -1, color them gray)
kwargs = {'color':'gray'}
n_clusters = n_clusters - 1
ax.scatter(data[cluster_assignments==ca, 0], data[cluster_assignments==ca, 1],s=5,alpha=0.5, **kwargs)
ax.set_xlabel('feature 1')
ax.set_ylabel('feature 2')
ax.set_title(f'No. Clusters = {n_clusters}')
ax.axis('equal')
ANSWER
(a) Plotting the Datasets
fig, axs = plt.subplots(len(X), 1, figsize=(5, 27))
for i, eachDataset in enumerate(X):
axs[i].scatter(eachDataset[:, 0], eachDataset[:, 1], s=5, alpha=0.5)
axs[i].set_title(f'Scatter Plot of Dataset {i+1}')
axs[i].set_xlabel('feature 1')
axs[i].set_ylabel('feature 2')
Plotting Elbow Curves for K Means Clustering of Each Dataset
from sklearn.cluster import KMeans
fig, axs = plt.subplots(len(X), 1, figsize=(5, 27))
for i, eachDataset in enumerate(X):
dissimilarity = []
for eachK in range(1, 11):
dissimilarity.append(KMeans(n_clusters=eachK, random_state=0).fit(eachDataset).inertia_)
axs[i].plot(dissimilarity)
axs[i].set_title(f'Inertia vs. Clusters for Dataset {i+1}')
axs[i].set_xlabel('Clusters')
axs[i].set_ylabel('Inertia')
The selected number of clusters for each dataset is:
Plotting the Clusters
fig, axs = plt.subplots(len(X), 1, figsize=(5, 27))
clusters = (2, 4, 3, 3, 2)
for i, eachDataset in enumerate(X):
plot_cluster(axs[i], eachDataset, KMeans(n_clusters=clusters[i], random_state=0).fit_predict(eachDataset))
(b) Plotting Clusters with DBSCAN
from sklearn.cluster import DBSCAN
fig, axs = plt.subplots(len(X), 1, figsize=(5, 27))
my_eps = (.4, .4, .3, 1, .1)
my_min_samples = (35, 15, 20, 40, 10)
for i, eachDataset in enumerate(X):
plot_cluster(axs[i], eachDataset, DBSCAN(eps=my_eps[i], min_samples = my_min_samples[i]).fit_predict(eachDataset))
(c) Apply Spectral Clustering
from sklearn.cluster import SpectralClustering
fig, axs = plt.subplots(len(X), 1, figsize=(5, 27))
clusters = (2, 4, 3, 3, 2)
for i, eachDataset in enumerate(X):
plot_cluster(axs[i], eachDataset, SpectralClustering(n_clusters=clusters[i], random_state=0).fit_predict(eachDataset))
(d) Reviewing the Clustering Techniques
For dataset 1, both K Means and Spectral Clustering worked equally well, but K Means is computationally less expensive. DBSCAN performed worse, leaving many points out of an expected cluster, and required substantially more hyperparameter tuning. K Means and Spectral Clustering required much less hyperparameter tuning.
(a) Reduce the dimensionality of the data with PCA for data visualization. Load the scikit-learn digits dataset (code provided to do this below). Apply PCA and reduce the data (with the associated cluster labels 0-9) into a 2-dimensional space. Plot the data with labels in this two dimensional space (labels can be colors, shapes, or using the actual numbers to represent the data - definitely include a legend in your plot).
(b) Create a plot showing the cumulative fraction of variance explained as you incorporate from $1$ through all $D$ principal components of the data (where $D$ is the dimensionality of the data).
You can use the explained_variance_ attribute of the PCA module in scikit-learn to assist with this question
(c) Reduce the dimensionality of the data with t-SNE for data visualization. T-distributed stochastic neighborhood embedding (t-SNE) is a nonlinear dimensionality reduction technique that is particularly adept at embedding the data into lower 2 or 3 dimensional spaces. Apply t-SNE using the scikit-learn implementation to the digits dataset and plot it in 2-dimensions (with associated cluster labels 0-9). You may need to adjust the parameters to get acceptable performance. You can read more about how to use t-SNE effectively here.
(d) Briefy compare/contrast the performance of these two techniques.
fit method and a fit_transform method, these methods are actually identical, and there is no transform method. Why is this? What implications does this imply for using this method?Note: Remember that you typically will not have labels available in most problems.
Code is provided for loading the data below.
################################
# Load the data
################################
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# load dataset
digits = datasets.load_digits()
n_sample = digits.target.shape[0]
n_feature = digits.images.shape[1] * digits.images.shape[2]
X_digits = np.zeros((n_sample, n_feature))
for i in range(n_sample):
X_digits[i, :] = digits.images[i, :, :].flatten()
y_digits = digits.target
ANSWER
(a) Plotting Digits with Principal Component Analysis
import altair as alt
import pandas as pd
alt.Chart(
pd.DataFrame(
np.hstack(
(PCA(n_components=2).fit_transform(X_digits), y_digits.reshape(-1, 1))
),
columns=["PC1", "PC2", "Digits"],
), title = "PCA of Digits Dataset"
).mark_circle().encode(x="PC1", y="PC2", color="Digits:N")
(b) Plotting proportion of variance explained by dimensions with PCA
explainedVarianceRatio = PCA(n_components=64).fit(X_digits).explained_variance_ratio_
plt.plot(explainedVarianceRatio.cumsum())
plt.title('Cumulative Explained Variance Ratio for PCA on Digits Dataset')
plt.xlabel('Dimensions')
plt.ylabel('Cumulative Explained Variance Ratio')
Text(0, 0.5, 'Cumulative Explained Variance Ratio')
print(f"The proportion of the variance explained by the first two principal components is {explainedVarianceRatio[:2].sum() *100:.2f} %.")
The proportion of the variance explained by the first two principal components is 28.51 %.
Since only a small portion of the variance is explained by the first two principal components, clustering in two dimensions will under-represent how well the data clusters.
(c) Plotting Clusters with t-SNE
alt.Chart(
pd.DataFrame(
np.hstack(
(TSNE(n_components=2, perplexity = 5).fit_transform(X_digits), y_digits.reshape(-1, 1))
),
columns=["t-SNE Feature 1", "t-SNE Feature 2", "Digits"],
), title = "t-SNE of Digits Dataset"
).mark_circle().encode(x="t-SNE Feature 1", y="t-SNE Feature 2", color="Digits:N")
(d) Comparing PCA with t-SNE
t-SNE appears to perform much better than PCA on this dataset, because it manages to separate the clusters much better. This is due to the fact that t-SNE is a non-linear dimensionality reduction, while PCA is a linear dimensionality reduction, and the features of the digits are non-linear (because it is image data).
t-SNE does not have a transform function, because it's non-linear transform requires all of the data to complete its dimensionality reduction. This means that if you had a validation set, or any other set, that the t-SNE was not fit on, but that you wanted to visualize in that space, you could not do this, because it would need to be fit with that data from the start.